Schedule for session


Why R?


The traditional approach to reports / analysis in the sciences


Disadvantages of this process


What is R?

Learning how to use R

RStudio and R Markdown

  • RStudio is a powerful user interface that helps you get better control of your analysis.
  • It is also completely free.
  • It comes in both a desktop version and a server version (on the cloud).
  • You can write your entire paper/report (text, code, analysis, graphics, etc.) all in R Markdown.
  • If you need to update any of your code, R Markdown will automatically update your plots and output of your analysis and will create an updated PDF/HTML file.
  • No more copy-and-paste!

What is Markdown?


What does it look like?

  # Header 1
  
  ## Header 2
  
  Normal paragraphs of text go here.
  
  **I'm bold**
  
  [links!](http://rstudio.com)
  
   * Unordered
   * Lists   
   
  And  Tables
  ---- -------
  Like This
  
markdown

What is R Markdown?

  • “Literate programming”
  • Embed R code in a Markdown document
  • Renders textual output along with graphics

```{r chunk_name}
x <- rnorm(1000)
length(x)
qplot(x, bins = 10, 
      fill = I("orange"), 
      color = I("black"))
```
## [1] 1000


But then I have to learn R…


Data summarization/visualization


Flights from PDX in 2014

library(dplyr)
library(pnwflights14)
data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  na.omit() %>% select(-year, -origin)
str(object = pdx_flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    52808 obs. of  14 variables:
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  1 8 28 526 541 549 559 602 606 618 ...
##  $ dep_delay: num  96 13 -2 -4 1 24 -1 -3 6 -2 ...
##  $ arr_time : int  235 548 800 1148 911 907 916 1204 746 1135 ...
##  $ arr_delay: num  70 -4 -23 15 4 12 -9 7 3 -30 ...
##  $ carrier  : chr  "AS" "UA" "US" "UA" ...
##  $ tailnum  : chr  "N508AS" "N37422" "N547UW" "N813UA" ...
##  $ flight   : int  145 1609 466 229 1569 649 796 1573 406 1650 ...
##  $ dest     : chr  "ANC" "IAH" "CLT" "IAH" ...
##  $ air_time : num  194 201 251 217 130 122 125 203 87 184 ...
##  $ distance : num  1542 1825 2282 1825 991 ...
##  $ hour     : num  0 0 0 5 5 5 5 6 6 6 ...
##  $ minute   : num  1 8 28 26 41 49 59 2 6 18 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:527] 145 146 147 148 149 150 189 304 310 313 ...
##   .. ..- attr(*, "names")= chr [1:527] "145" "146" "147" "148" ...

Random sample

We randomly select 2000 flights from this set of 52808 flights.

set.seed(2016)
pdx_rs <- pdx_flights %>% sample_n(3000)

Question of interest

Does a statistically significant difference exist in the mean departure delays for airlines departing PDX?


Using visualizations to get a hint

What type of plot will help us visualize this?

  • Explanatory variable: categorical

  • Response variable: continuous

    • Side-by-side boxplot

Mindmap to assist

Coggle Diagram


Multiple Means (ANOVA)

Coggle Diagram for ANOVA


Plotting our data

library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")


Making tweaks (Part I)

# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA)


Making tweaks (Part II)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45))


Making tweaks (Part III)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun.y = "mean", geom = "point", color = "red")


Data summarizing


Get airline names

data("airlines", package = "pnwflights14")
pdx_join <- inner_join(x = pdx_summary, y = airlines, by = "carrier")
kable(pdx_join)
carrier Mean Delay Median Delay name
AA 23.2268908 -2.0 American Airlines Inc.
AS 0.9781977 -5.0 Alaska Airlines Inc.
B6 5.2826087 -3.0 JetBlue Airways
DL -0.0369128 -3.0 Delta Air Lines Inc.
F9 4.3783784 -3.0 Frontier Airlines Inc.
HA -4.3333333 -5.0 Hawaiian Airlines Inc.
OO 4.1436266 -4.0 SkyWest Airlines Inc.
UA 8.6853933 -2.0 United Air Lines Inc.
US 4.5902778 -3.5 US Airways Inc.
VX 2.7391304 -4.0 Virgin America
WN 13.2713816 2.0 Southwest Airlines Co.

Data analysis


Hypothesis test

Assuming conditions are met…

pdx_anova <- aov(formula = dep_delay ~ carrier, data = pdx_rs)
summary(pdx_anova)
##               Df  Sum Sq Mean Sq F value             Pr(>F)    
## carrier       10  103041   10304    8.96 0.0000000000000111 ***
## Residuals   2989 3437331    1150                               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of results


Our friend the \(p\)-value

  • The \(p\)-value resulting from our analysis is 0. This corresponds to the probability of obtaining an observed \(F\) statistic of 8.960177 or greater, assuming that the means departure delays for all carriers is the same (the null hypothesis is true).

  • This small \(p\)-value leads us to reject the null hypothesis in favor of the alternative: at least one of the carriers has a departure delay that is different than the others.


Reproducible research


Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.”

  • Roger Peng, Johns Hopkins

Full worked out example

A full worked out example of this analysis is available here as an HTML file to view. The corresponding R Markdown file is available here.


How did we do?

We collected a random sample of the actual data on all 2014 flights departing PDX. Does a difference actually exist in the average departure delays for carriers in our population (all 2014 flights departing PDX)?

pdx_summary <- pdx_rs %>% group_by(carrier) %>%
  summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_summary)

Answer

pdx_full_summary <- pdx_flights %>% group_by(carrier) %>%
  summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_full_summary)
carrier Mean Delay Median Delay
AA 13.0708625 -2
AS 0.9418523 -5
B6 5.9677926 -3
DL 2.5678412 -3
F9 8.4546125 -3
HA -0.8027397 -5
OO 4.2595904 -4
UA 7.3794427 -2
US 1.5259545 -3
VX 6.2477477 -4
WN 12.1458352 1

Homework problems


Thanks!


cismay@reed.edu



sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rmarkdown_0.9.6         knitr_1.13              ggplot2_2.1.0           dplyr_0.4.3.9001       
## [5] pnwflights14_0.1.0.9000 revealjs_0.6.1         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5      magrittr_1.5     munsell_0.4.3    colorspace_1.2-6 R6_2.1.2        
##  [6] highr_0.6        stringr_1.0.0    plyr_1.8.3       tools_3.3.0      grid_3.3.0      
## [11] gtable_0.2.0     DBI_0.4-1        htmltools_0.3.5  lazyeval_0.1.10  yaml_2.1.13     
## [16] assertthat_0.1   digest_0.6.9     tibble_1.0-3     formatR_1.4      evaluate_0.9    
## [21] labeling_0.3     stringi_1.0-1    scales_0.4.0